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Abstract. In this paper we use ternary representation of numbers for compressing text data. We 
use a binary map for ternary digits and introduce a way to use the binary 11-pair, which has never 
been use for coding data before, and we futher use 4-Digits ternary representation of alphabet with 
lowercase and uppercase with some extra symbols that are most commonly used in day to day life. We 
find a way to minimize the length of the bits string, which is only possible in ternary representation 
thus drastically reducing the length of the code. We also find some connection between this technique 
of coding data and Fibonacci numbers. 
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Introduction 

Ternary or trinary is the base-3 numeral system. A ternary digit, trit contains about 1.58596 {log2'i) 
bit of information. Even though ternary most often refers to a system in which the three digits, 0, 1, 
and 2, are all nonnegative integers, the adjective also lends its name to the balanced ternary system, 
which uses -1, and +1 instead, used in comparison logic and ternary converters [I]|2]E||3- 

Techniques have been developed for text document compression that are semiadaptive, which uses 
frequncy-ordered array of word- number mappings Encoding binary digital data in ternary 

form has applicability to digital data communication systems and magnetic data storage systems jH] 
and also in high-speed binary multipliers, which uses ternary representations of two numbers [S]. 
Methods have been developed for converting binary signals into shorter balanced ternary code signals 

m- 

Variable length coding is a widely-used method in data compression, especially, in the applications 
of video data communication and storage, for example, JPEG, MPEG, CCITT H.261 and so on. Most 
of those methods implement the coding with two-field codes. Hsieh fTUl introduced a method using 
three-field representation for each code. Also, we can adopt ternary systems in difference coding in 
audio compression[TT][T2J. The node of a Peano Curve can be represented by a base-3 reflected Gray 
Codes, RGC[T3!. 

Ternary systems have been used in digital data recording systems [14 and precoded ternary data 
transmission systems as well [15 from which they can send and store more data compared to what 
binary system does. 

Digital data compression is an important tool because it can be utilized, for example, to reduce the 
storage requirements for files, to increase the rate at which data can be transferred over bandwidth 
limited communication channels, and to reduce the internal redundancy of data prior to its encryption 
in order to provide increased security. 

In this paper we use standard ternary representation for coding data. We introduce a new repre- 
sentation called BaseB23, which uses both ternary and binary representation with the maximum use 
of both binary and ternary features in a single coded data. Before going further, let us compare the 
binary and ternary representaions of few numbers, 

85io = IOIOIOI2 =100103 and 150io = IOOIOIIO2 = I212O3 

Thus, the binary represenatation of 85 uses 7 bits while ternary representation uses only 5 trits, and 
in 150, it uses 5 trits again, which is more than 40% saving in memory. Here we also notice that the 
12-pair occurs twice in this representation. Thus if we can use a better representation which can code 
the 12-pair we can save more memory when saving these data and can save more time when sending 
to another detector. Thus our main goal is to develpoe a system which uses ternary representation in 
a sophisticated mannar. That is what we are going to do in the rest of this paper. 
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The Basic Definitions 
We begin with the following two definitions. 

Definition 1. Let n = eie2e3...e/5, where ej =0,1, 2; fc e be the ternary representation of the 
decimal number n. Then we say n is in the form A23, if there is a map if : {0, 1, 2} 1 — > {00, 01, 10} 
such that, 



00, if = 0; 

01, ife. = 1; 
10, otherwise. 



for each i = 1,2, 3, k. 



With this definition, we can write 85io = IOOIO3 = 0100000100^23 and 

I5O10 = I212O3 = OIIOOIIOOOA23. Thus it double the lenth of the string which represents the 
number in A23 base. But in this format we waste the 11 pair. So we need to modify the coding so 
that we can make use of the binary string II2. Thus we come up with the following definition, with 
the so called Base 823- 



Definition 2. Let n = eie2e3...efe, where Cj =0,1, 2; k € N be the ternary representation of the 
decimal number n. Then we say n = SiS2£3---£i, where £j = 00,01,10,11; I < k, is in Base B23, if 
there is a map ip : {0, 1, 2} 1 — > {00, 01, 10, 11} such that ip (ej) = Sj, where 



{00, ife. = 0; 
01, if Ci = landci+i ^ 2; 
10, if a = 2 and a-iT^l; 
11, ifeiei+i = 12. 

for each i = 1,2,3, 

That is, we say a number n is in base B23, if each of the trits of its ternary representation is replaced 
by the following binary bits in that order: 

i-^-OO, 

12 ^11, 

1 h^Ol, 

2 h^lO. 

So we can write 85io = OlOOOOOlOO^^a = OlOOOOOlOOs^s and 150io = OHOOHOOOa^s = llllOOB^a- 
Thus it drastically reduces the length of the bits string if the I23 is present in the coding. The more 
123 pairs arc present, the more compact the code can be. Therefore, it is natural to seek the 
availability of I23 pairs in a string of ternary representation of a data. Thus we have the following 
lemma. 

Lemma. Golden Lemma 

The number Sn of ways that a string of trits 0, 1 and 2 of length n with no 12 pairs in the string is 

5„ =-ig{<^2n+2 _ 0-2n-2| foj. n & N , 

where cf) is the Golden Ratio ^"b^ . 



Proof. We prove this result using a recurrence relation. First consider the following construction of 
string of 0,1 and 2 of size n. 
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Figure 1: Deriving Formula 

According to the diagram we have, 

Sn = 2S'„_i + Sn-2 + Sn-3 + ■■■ + S2 + Si + 2, 



with 5*1 = 3 and ^2 = 8 for n> 3. This rcchiccs to Sn = 3S'„_i — Sn-2 with 
n> 3. Solving the recurrence relation with the given data, we have, 



3 and 6*2 = 8 for 



Sn 



1 

'VE 



i±^f"^'_(i^j |forn>3 



^2n+2 _ <^-2n-2} ^ ^j^g^e <^ =i^. 



□ 



The first few terms of this sequence are Si =3, 5*2 = 8, 5,3 =21, 5*4 = 55, 55 = 144. These are the 
even terms of the Fibonachchi sequence defined by, Fn = -Fn-i + Fn-2 with Fi =2, F2 = 3. 



Now, consider a uniformely distributed ternary string of 0, 1, and 2 of size n. Then the following 
holds: 

Theorem. The probability of appearing at least one 12— pair of a string of 0, 1, and 2 of length n is 

asymptotically 1. 

Proof. If 5„ is the number of ways that a string of trits 0, 1 and 2 of size n can be arranged with no 
12— pairs, then the probability of getting at least one pair of 12 is 



Therefore, 1"* Pn = 'iT 



Pn 



71 



Sn 



3" 



3"v^ 

lim 



A2n+2 



-2ti-2 



□ 



This shows that when n gets larger, the mapping 12^ 11, should efficiently reduce the length of 
the B23 string. 

Data compression with B23 

Here we discuss one of the main usage of B23 in coding data in day to day life. When compared 
to other benifits of converting data into binary form, word processing takes the leadership. Thus in 
this paper we discuss a technique that could drastically increase the storing and sending capabilities 
of data using the so called B23ternary coding. 

Algorithm. Let A={ai, 02, a^, ...,o„} be an alphabet with, p (oi) being the probability of Oi appearing 
in the language generated by A. Without loss of generality, suppose p (ai) > p (02) > ■■■ > p • Let 
B={bi,b2, 63, 6„} be a sequence of integers written in ternary form. Define, 6 : B Z by 

d{bi) = j, where h = ...12i...l22...12j... 
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Suppose without loss of generality, 6{bi) > (^(62) > 
such that Q{ai) = hi for i=l,2,...,n. 



> 6{bn). Then we define a map CI : Ai 



B 



We notice that this scheme genarates a coding for the language generated by A. Now, we turn to a 
more practical example. That is the english alphabet and the language generated by it. 

Example. Here we are going to find a B23 code for 26-letter english alphabet, which is difi^erent from 
Huffman coding of 26-letter alphabet and also is different from the ternary Huffman coding^TT] [TB] [T7] [TB] . 
We also include some extra characters, which are frequently used in word processing. Before going 
further, we need to observe some facts related to the usage of english alphabet. When we use a 
language, we have to use some letters more frequently than others. Particularly in English the letter 
'e' has much highest frequency compared to the others characters in the alphabet, while the letter 'z' 
has the lowest frequency. Some of these facts is summaries in the foUwing table accompanied with 
the two charts 1 and 2. Q 

12,7 ■ 
12 ■ 



£ 00 



I 



I 



I 



c 10 ■ 

ID 



IB 



m 




a b c d e f g h I J K I 111 n p q r s t LI V w X y' 
Letter 



etaoi nshrd I cumwfgypbvkjxqz 
Letter 



Figure 2: Bar-chart of frequency order of 26-letter English alphabet 



According to the table below, the letters, e, t, a, o, i, n, s, h, and r has comparatively high frequency 
than the characters p, b, v, k, j, x, q, and z. Therefore if we can use a shoter code for the high frequence 
letters and comparatively longer code for low frequency characters we can possible reduce the string 
lenth for the coding sequence. This is the whole goal in the rest of the paper. 



Letter 


Frequency(%) 


Letter 


Frequency(%) 


a 


8.167 


n 


6.749 


b 


1.492 





7.507 


c 


2.782 


P 


1.929 


d 


4.253 


q 


0.095 


e 


12.702 


r 


5.987 


f 


2.228 


s 


6.327 


g 


2.015 


t 


9.056 


h 


6.094 


u 


2.758 


i 


6.966 


V 


0.978 


,i 


0.153 


w 


2.360 


k 


0.772 


X 


0.150 


1 


4.025 


y 


1.974 


m 


2.406 


z 


0.074 



Table 1: Frequency order of 26-letter English alphabet 



^Cryptographical Mathematics, Robert Edward Lewand. MAA, Washington DC, 2000 
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Above table force us to reorder the alphabet so that we use a short code for high frequency letter, 
while longer code for low frequncy letters. Before doing this we have to notice few things that are 
characteristics to the English alphabet. One major thing is alphabet has only 26 characters. Thus if 
we use a ternary coding with three trits, we can code all the 26 characters, but if we use binary coding 
we have to use 5 bits to handle these characters. This is the major observation in ternary system 
compared to binary system. 

3 Exchanged Character Map - 3ECM 

To represent high frequency characters with short code, we simply exchanged the positions of 
the English alphabet. Letters e, t, a, o, ... get shoter codes while the letters v, k, x, q, and z get 
comparatively longer codes as in the Huffman-encoding. What remains is to consider which upper 
case letters, typically the first letter of a word, appear more frequently as the first letter of a word. 
The top ten letters with frequencies, which occur at the beginning of words are: 



Letter 


T 


A 


I S 





C 


M 


F 


P 


W 


Frequency (%) 


15.94 


15.5 


8.23 7.75 


7.12 


5.97 


4.26 


4.08 


4.0 


3.82 



Table 2: Frequency of the first letters 

Clearly, the order differs from that for lower case (cf. t & e Vs T & E). Thus we propose the 
following character map with few extra sysmbols and accompanied ternary codes: 



Dec 


Symbol 


Ternary 


Dec 


Symbol 


Ternary 


Dec 


Symbol 


Ternary 





W 


0000 


27 


z 


1000 


54 


I 


2000 


1 


N 


0001 


28 


P 


1001 


55 


$ 


2001 


2 


B 


0002 


29 


b 


1002 


56 




2002 


3 


C 


0010 


30 


w 


1010 


57 


% 


2010 


4 


D 


0011 


31 


X 


1011 


58 




2011 


5 


T 


0012 


32 


e 


1012 


59 


1 


2012 


6 


F 


0020 


33 


f 


1020 


60 




2020 


7 


G 


0021 


34 


g 


1021 


61 


/ 


2021 


8 


H 


0022 


35 


V 


1022 


62 




2022 


9 


P 


0100 


36 


q 


1100 


63 


< 


2100 


10 


J 


0101 


37 


j 


1101 


64 


> 


2101 


11 


K 


0102 


38 


k 


1102 


65 


@ 


2102 


12 


L 


0110 


39 


y 


1110 


66 


& 


2110 


13 


M 


0111 


40 


m 


nil 


67 




2111 


14 


A 


0112 


41 


n 


1112 


68 




2112 


15 





0120 


42 





1120 


69 


7 


2120 


16 


I 


0121 


43 


a 


1121 


70 


( 


2121 


17 


s 


0122 


44 


i 


1122 


71 


) 


2122 


18 


R 


0200 


45 


r 


1200 


72 


{ 


2200 


19 


Q 


0201 


46 


s 


1201 


73 


} 


2201 


20 


T 


0202 


47 


t 


1202 


74 


[ 


2202 


21 


U 


0210 


48 


u 


1210 


75 


] 


2210 


22 


V 


0211 


49 


h 


1211 


76 


\ 


2211 


23 




0212 


50 


Space 


1212 


77 




2212 


24 


X 


0220 


51 


d 


1220 


78 




2220 


25 


Y 


0221 


52 


1 


1221 


79 


+ 


2221 


26 


z 


0222 


53 


c 


1222 


80 




2222 



Table 3: Coding Table 
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In our table the ternary code of almost all most frequent letters contain at least one 12 pair. 
Applying the B23 scheme results in the following code table. 

According to the table, what we achieve here is that high frequent characters has shorter lenth 
compared to the others. Let us exemplify the method. For that we use a test string and code it in two 
ways, using B23 and the standed ASCII code and then compare the two bitstrings generated from 
these techniques. It is done in the following way. The first Algorithm converts the text directly into 
B23 — code, and Algorithms 2 and 3 convert it back into human readable characters. 



Symbol 


Xeriiciry 


-O23 


Symbol 


Xcrnary 




Symbol 


Teriiciry 


D 

-D23 


w 


0000 


00000000 


z 


1000 


01000000 


I 


2000 


10000000 


N 


0001 


00000001 


p 


1001 


01000001 


$ 


2001 


10000001 


B 


0002 


00000002 


b 


1002 


01000010 




2002 


10000010 


C 


0010 


00000100 


w 


1010 


01000100 


% 


2010 


10000100 


D 


0011 


00000101 


X 


1011 


01000101 




2011 


10000101 


T 


0012 


000011 


e 


1012 


010011 




2012 


100011 


F 


0020 


00001000 


f 


1020 


01001000 


* 


2020 


10001000 


G 


0021 


00001001 


g 


1021 


01001001 


/ 


2021 


10001001 


H 


0022 


00001010 


v 


1022 


01001010 




2022 


10001010 


P 


0100 


00010000 


q 


1100 


01010000 


< 


2100 


10010000 


J 


0101 


00010001 


j 


1101 


01010001 


> 


2101 


10010001 


K 


0102 


00010010 


k 


1102 


01010010 


@ 


2102 


10010010 


L 


0110 


00010100 


y 


1110 


01010100 


& 


2110 


10010100 


M 


0111 


00010101 


m 


1111 


01010101 


; 


2111 


10010101 


A 


0112 


000111 


n 


1112 


010111 


u 


2112 


100111 





0120 


001100 





1120 


011100 


7 


2120 


101100 


I 


0121 


001101 


a 


1121 


011101 


( 


2121 


101101 


S 


0122 


001110 


i 


1122 


011110 


) 


2122 


101110 


R 


0200 


00100000 


r 


1200 


110000 


{ 


2200 


10100000 


Q 


0201 


00100001 


s 


1201 


110001 


} 


2201 


10100001 


T 


0202 


00100010 


t 


1202 


110010 


[ 


2202 


10100010 


U 


0210 


00100100 


u 


1210 


110100 


1 


2210 


10100100 


V 


0211 


00100101 


h 


1211 


110101 


\ 


2211 


10100101 




0212 


001011 


Space 


1212 


1111 




2212 


101011 


X 


0220 


00101000 


d 


1220 


111000 




2220 


10101000 


Y 


0221 


00101001 


1 


1221 


111001 


+ 


2221 


10101001 


z 


0222 


00101010 


c 


1222 


111010 




2222 


10101010 



Table 4: Complete Coding Table 

Algorithm 1. Coding into B23 form. Here we assume that the text string only consists of the 
characters listed in the above table. 

Let Uij:= {(W, 0000), (N, 0001), (B, 0002), ... ,(-, 2222)}, i = 1, 2, 81; j = 1,2 

Input: Text String S. Let l=length(S); NewString =[], 

for i=0 to I do 

if Uii = character At{S,i) 

NewString = NewString +Ui2, 

end do. 

Once we received the string to the destination we use the following two algorithm to decode it. 
First one to transform it back into ternary and then the second one to decode it to human readable 
code. 

Algorithm 2. Converting into ternary. 

Input received string S. Let l=length(S/2), NewString = []. 

For 1=1 to I 

if Substring (S,i,i+1) = '00' then NewString = NewString + '0', 
elseif Substring (S,i,i+1) = '01' then NewString = NewString + '1' 
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elseif SuhString(S,i,i+l) = '10' then NewString = NewString + '2' 
else NewString = NewString + '12' 

i = i+2; 
end do 

Now we read this string as a four character groups and assign each such group into a single character 
according to the above table. So we use the following algoritm 

Algorithm 3. Decoding ternary into human readable characters 

LetUij:= {(W, 0000), (N, 0001), (B, 0002), ... ,(-, 2222)}, i = 1, 2, 81; j = 1,2 

Input: Ternary String S. Let l=length(S); NewString =[], 

for i=0 to I do 

if = subString{S, i,i + 3) 

NewString = NewString H-Un, 

else NewString = NewString+[], 

i=i+4; 

end do. 

Upperbound for Compression Ratio. To derive an upperbound for the compression ratio, we 
have to make few assumptions since this compression is dynamic. Let us assume for a longer text at 

least the space has the highest frequency. In order to get a numerical value wc assTimc the chance of 
space is 50% and the other letters Ui has the probabilities, pi half of the table values. Then 

^ . r. . Length of compressed text 

Compression Ratio = 



< 



Length of uncompressed text ' 
J2^=i length of letter after compression x pi 
S^r=i length of letter before compression x pi ' 
4 X 100 + 6 * (8.167 + ... + 2.758) + 8 * (1.492 + ... + 0.074) 

8 X 200 ' 

0.64577 



We can achieve much stronger compressions when we applied this scheme with the Huffman encod- 
ing. We can directly use the B23 scheme, once the technology develops to a level where we can use 
ternary bitstrings for data transmission. 

Now we use the above algorithms to the following example, and cpmpare the results with the familier 
ASCII encoder. We notice that even for a short code we see noticable reduction of the coded string. 

Example. So, consider the text string, 

S = " This is the test message." 
Once we apply the code into B23,we get, 

CodedString = 0000111101010111101100011111011110110001111111001011010101001111111100100 
1001111000111001011110101010101001111000111000101110101001001010011001011 

This can be compared with the corresponding Binary string generated by ASCII coding, which is 
25% larger than the B23, 

BinaryString = 1010100011010000110100101110011001000000110100101110011001000000111010001 

1010000110010100100000011101000110010101110011011101000010000001101101011 
00101011100110111001101100001011001110110010100101110 



After using second algorithm, we end up getting, 

DecodedString = "This is the test message." 
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Thus if we can adopt this technique in word processing and data compression we can drastically 
reduce the memory needed to store information and also in data transmission. 

We can extend this technique for six-digits ternary system with more characters than in this case. 
That would be the nest task. We can also extend this technique with slight modifications for com- 
pressing highly randomized data piT. We conclude the paper with the following question, 

Question: What kind of distributions has more 12— pairs in a string of 0, 1, and 2? 

Acknowledgement. Author would like to express his heart felt gratitude to Dr. Jerzy Kocik of De- 
partment of Mathematics of Southern Illinois University, for his invaluable suggestions and continued 
support . 
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